Explainable Boosting Machine

See the reference paper for full details [1].

Summary

Explainable Boosting Machine (EBM) is a tree-based, cyclic gradient boosting Generalized Additive Model with automatic interaction detection. EBMs are often as accurate as state-of-the-art blackbox models while remaining completely interpretable. Although EBMs are often slower to train than other modern algorithms, EBMs are extremely compact and fast at prediction time.

How it Works

As part of the framework, InterpretML also includes a new interpretability algorithm – the Explainable Boosting Machine (EBM). EBM is a glassbox model, designed to have accuracy comparable to state-of-the-art machine learning methods like Random Forest and Boosted Trees, while being highly intelligibile and explainable. EBM is a generalized additive model (GAM) of the form:

\[ g(E[y]) = \beta_0 + \sum f_j(x_j) \]

where \(g\) is the link function that adapts the GAM to different settings such as regression or classification.

EBM has a few major improvements over traditional GAMs [2]. First, EBM learns each feature function \(f_j\) using modern machine learning techniques such as bagging and gradient boosting. The boosting procedure is carefully restricted to train on one feature at a time in round-robin fashion using a very low learning rate so that feature order does not matter. It round-robin cycles through features to mitigate the effects of co-linearity and to learn the best feature function \(f_j\) for each feature to show how each feature contributes to the model’s prediction for the problem. Second, EBM can automatically detect and include pairwise interaction terms of the form:

\[ g(E[y]) = \beta_0 + \sum f_i(x_i) + \sum f_{i,j}(x_i,x_j) \]

which further increases accuracy while maintaining intelligibility. EBM is a fast implementation of the GA2M algorithm [1], written in C++ and Python. The implementation is parallelizable, and takes advantage of joblib to provide multi-core and multi-machine parallelization. The algorithmic details for the training procedure, selection of pairwise interaction terms, and case studies can be found in [1, 3, 4].

EBMs are highly intelligible because the contribution of each feature to a final prediction can be visualized and understood by plotting \(f_j\). Because EBM is an additive model, each feature contributes to predictions in a modular way that makes it easy to reason about the contribution of each feature to the prediction.

To make individual predictions, each function \(f_j\) acts as a lookup table per feature, and returns a term contribution. These term contributions are simply added up, and passed through the link function \(g\) to compute the final prediction. Because of the modularity (additivity), term contributions can be sorted and visualized to show which features had the most impact on any individual prediction.

To keep the individual terms additive, EBM pays an additional training cost, making it somewhat slower than similar methods. However, because making predictions involves simple additions and lookups inside of the feature functions \(f_j\), EBMs are one of the fastest models to execute at prediction time. EBM’s light memory usage and fast predict times makes it particularly attractive for model deployment in production.

If you find video as a better medium for learning the algorithm, you can find a conceptual overview of the algorithm below: Video for explaining how EBM works.

Code Example

The following code will train an EBM classifier for the adult income dataset. The visualizations provided will be for both global and local explanations.

from interpret import set_visualize_provider
from interpret.provider import InlineProvider
set_visualize_provider(InlineProvider())
import pandas as pd
from sklearn.model_selection import train_test_split

from interpret.glassbox import ExplainableBoostingClassifier
from interpret import show

df = pd.read_csv(
    "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
    header=None)
df.columns = [
    "Age", "WorkClass", "fnlwgt", "Education", "EducationNum",
    "MaritalStatus", "Occupation", "Relationship", "Race", "Gender",
    "CapitalGain", "CapitalLoss", "HoursPerWeek", "NativeCountry", "Income"
]
df = df.sample(frac=0.05)
train_cols = df.columns[0:-1]
label = df.columns[-1]
X = df[train_cols]
y = df[label]

seed = 1
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=seed)

ebm = ExplainableBoostingClassifier(random_state=seed)
ebm.fit(X_train, y_train)

ebm_global = ebm.explain_global()
show(ebm_global)

ebm_local = ebm.explain_local(X_test[:5], y_test[:5])
show(ebm_local)

Bibliography

1(1,2,3)

Yin Lou, Rich Caruana, Johannes Gehrke, and Giles Hooker. Accurate intelligible models with pairwise interactions. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, 623–631. 2013.

2

Trevor Hastie and Robert Tibshirani. Generalized additive models: some applications. Journal of the American Statistical Association, 82(398):371–386, 1987.

3

Yin Lou, Rich Caruana, and Johannes Gehrke. Intelligible models for classification and regression. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, 150–158. 2012.

4

Rich Caruana, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and Noemie Elhadad. Intelligible models for healthcare: predicting pneumonia risk and hospital 30-day readmission. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, 1721–1730. 2015.

API

ExplainableBoostingClassifier

class interpret.glassbox.ExplainableBoostingClassifier(feature_names=None, feature_types=None, max_bins=256, max_interaction_bins=32, binning='quantile', mains='all', interactions=10, outer_bags=8, inner_bags=0, learning_rate=0.01, validation_size=0.15, early_stopping_rounds=50, early_stopping_tolerance=0.0001, max_rounds=5000, min_samples_leaf=2, max_leaves=3, n_jobs=- 2, random_state=42)

Explainable Boosting Classifier. The arguments will change in a future release, watch the changelog.

Parameters
  • feature_names – List of feature names.

  • feature_types – List of feature types.

  • max_bins – Max number of bins per feature for pre-processing stage.

  • max_interaction_bins – Max number of bins per feature for pre-processing stage on interaction terms. Only used if interactions is non-zero.

  • binning – Method to bin values for pre-processing. Choose “uniform”, “quantile” or “quantile_humanized”.

  • mains – Features to be trained on in main effects stage. Either “all” or a list of feature indexes.

  • interactions – Interactions to be trained on. Either a list of lists of feature indices, or an integer for number of automatically detected interactions. Interactions are forcefully set to 0 for multiclass problems.

  • outer_bags – Number of outer bags.

  • inner_bags – Number of inner bags.

  • learning_rate – Learning rate for boosting.

  • validation_size – Validation set size for boosting.

  • early_stopping_rounds – Number of rounds of no improvement to trigger early stopping.

  • early_stopping_tolerance – Tolerance that dictates the smallest delta required to be considered an improvement.

  • max_rounds – Number of rounds for boosting.

  • min_samples_leaf – Minimum number of cases for tree splits used in boosting.

  • max_leaves – Maximum leaf nodes used in boosting.

  • n_jobs – Number of jobs to run in parallel.

  • random_state – Random state.

decision_function(X)

Predict scores from model before calling the link function.

Parameters

X – Numpy array for samples.

Returns

The sum of the additive term contributions.

explain_global(name=None)

Provides global explanation for model.

Parameters

name – User-defined explanation name.

Returns

An explanation object, visualizing feature-value pairs as horizontal bar chart.

explain_local(X, y=None, name=None)

Provides local explanations for provided samples.

Parameters
  • X – Numpy array for X to explain.

  • y – Numpy vector for y to explain.

  • name – User-defined explanation name.

Returns

An explanation object, visualizing feature-value pairs for each sample as horizontal bar charts.

explainer_type = 'model'

Public facing EBM classifier.

fit(X, y, sample_weight=None)

Fits model to provided samples.

Parameters
  • X – Numpy array for training samples.

  • y – Numpy array as training labels.

  • sample_weight – Optional array of weights per sample. Should be same length as X and y.

Returns

Itself.

get_params(deep=True)

Get parameters for this estimator.

Parameters

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params – Parameter names mapped to their values.

Return type

dict

predict(X)

Predicts on provided samples.

Parameters

X – Numpy array for samples.

Returns

Predicted class label per sample.

predict_and_contrib(X, output='probabilities')

Predicts on provided samples, returning predictions and explanations for each sample.

Parameters
  • X – Numpy array for samples.

  • output – Prediction type to output (i.e. one of ‘probabilities’, ‘logits’, ‘labels’)

Returns

Predictions and local explanations for each sample.

predict_proba(X)

Probability estimates on provided samples.

Parameters

X – Numpy array for samples.

Returns

Probability estimate of sample for each class.

score(X, y, sample_weight=None)

Return the mean accuracy on the given test data and labels.

In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.

Parameters
  • X (array-like of shape (n_samples, n_features)) – Test samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True labels for X.

  • sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

Returns

score – Mean accuracy of self.predict(X) wrt. y.

Return type

float

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters

**params (dict) – Estimator parameters.

Returns

self – Estimator instance.

Return type

estimator instance

ExplainableBoostingRegressor

class interpret.glassbox.ExplainableBoostingRegressor(feature_names=None, feature_types=None, max_bins=256, max_interaction_bins=32, binning='quantile', mains='all', interactions=10, outer_bags=8, inner_bags=0, learning_rate=0.01, validation_size=0.15, early_stopping_rounds=50, early_stopping_tolerance=0.0001, max_rounds=5000, min_samples_leaf=2, max_leaves=3, n_jobs=- 2, random_state=42)

Explainable Boosting Regressor. The arguments will change in a future release, watch the changelog.

Parameters
  • feature_names – List of feature names.

  • feature_types – List of feature types.

  • max_bins – Max number of bins per feature for pre-processing stage on main effects.

  • max_interaction_bins – Max number of bins per feature for pre-processing stage on interaction terms. Only used if interactions is non-zero.

  • binning – Method to bin values for pre-processing. Choose “uniform”, “quantile”, or “quantile_humanized”.

  • mains – Features to be trained on in main effects stage. Either “all” or a list of feature indexes.

  • interactions – Interactions to be trained on. Either a list of lists of feature indices, or an integer for number of automatically detected interactions.

  • outer_bags – Number of outer bags.

  • inner_bags – Number of inner bags.

  • learning_rate – Learning rate for boosting.

  • validation_size – Validation set size for boosting.

  • early_stopping_rounds – Number of rounds of no improvement to trigger early stopping.

  • early_stopping_tolerance – Tolerance that dictates the smallest delta required to be considered an improvement.

  • max_rounds – Number of rounds for boosting.

  • min_samples_leaf – Minimum number of cases for tree splits used in boosting.

  • max_leaves – Maximum leaf nodes used in boosting.

  • n_jobs – Number of jobs to run in parallel.

  • random_state – Random state.

decision_function(X)

Predict scores from model before calling the link function.

Parameters

X – Numpy array for samples.

Returns

The sum of the additive term contributions.

explain_global(name=None)

Provides global explanation for model.

Parameters

name – User-defined explanation name.

Returns

An explanation object, visualizing feature-value pairs as horizontal bar chart.

explain_local(X, y=None, name=None)

Provides local explanations for provided samples.

Parameters
  • X – Numpy array for X to explain.

  • y – Numpy vector for y to explain.

  • name – User-defined explanation name.

Returns

An explanation object, visualizing feature-value pairs for each sample as horizontal bar charts.

explainer_type = 'model'

Public facing EBM regressor.

fit(X, y, sample_weight=None)

Fits model to provided samples.

Parameters
  • X – Numpy array for training samples.

  • y – Numpy array as training labels.

  • sample_weight – Optional array of weights per sample. Should be same length as X and y.

Returns

Itself.

get_params(deep=True)

Get parameters for this estimator.

Parameters

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params – Parameter names mapped to their values.

Return type

dict

predict(X)

Predicts on provided samples.

Parameters

X – Numpy array for samples.

Returns

Predicted class label per sample.

predict_and_contrib(X)

Predicts on provided samples, returning predictions and explanations for each sample.

Parameters

X – Numpy array for samples.

Returns

Predictions and local explanations for each sample.

score(X, y, sample_weight=None)

Return the coefficient of determination \(R^2\) of the prediction.

The coefficient \(R^2\) is defined as \((1 - \frac{u}{v})\), where \(u\) is the residual sum of squares ((y_true - y_pred) ** 2).sum() and \(v\) is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a \(R^2\) score of 0.0.

Parameters
  • X (array-like of shape (n_samples, n_features)) – Test samples. For some estimators this may be a precomputed kernel matrix or a list of generic objects instead with shape (n_samples, n_samples_fitted), where n_samples_fitted is the number of samples used in the fitting for the estimator.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True values for X.

  • sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

Returns

score\(R^2\) of self.predict(X) wrt. y.

Return type

float

Notes

The \(R^2\) score used when calling score on a regressor uses multioutput='uniform_average' from version 0.23 to keep consistent with default value of r2_score(). This influences the score method of all the multioutput regressors (except for MultiOutputRegressor).

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters

**params (dict) – Estimator parameters.

Returns

self – Estimator instance.

Return type

estimator instance